Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.[1]
Consensus protocols are the basis for the state machine approach to distributed computing, as suggested by Leslie Lamport[2] and surveyed by Fred Schneider.[3] The state machine approach is a technique for converting an algorithm into a fault-tolerant, distributed implementation. Ad-hoc techniques may leave important cases of failures unresolved. The principled approach proposed by Lamport et al. ensures all cases are handled safely.
The Paxos protocol was described and named in 1990, after a fictional legislative consensus system used in the Ionian Paxos islands of Greece.[4] It was widely circulated as a technical report in that period, but was not published until 1998.[5] The topic predates the protocol. For example, state machine replication is a common style of programming within the Virtual synchrony execution model, which was introduced in 1985.[6] The virtually synchronous membership agreement protocol published by Birman and Joseph in connection with the Isis system, in 1987, was very similar to the leader-based Paxos protocol shown below.[7] In 1988, Lynch, Dwork and Stockmeyer had demonstrated [8] the solvability of consensus in a broad family of "partially synchronous" systems. Paxos has strong similarities to a protocol used for agreement in viewstamped replication, first published by Oki and Liskov in 1988, in the context of distributed transactions.[9] Notwithstanding this prior work, Paxos offered a particularly elegant formalism, and included one of the earliest proofs of safety for a fault-tolerant distributed consensus protocol.
The Paxos family of protocols includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Although no fault-tolerant consensus protocol can guarantee progress (a result proved in a paper by Fischer, Lynch and Paterson[10]), Paxos guarantees safety (freedom from inconsistency), and the conditions that could prevent it from making progress are difficult to provoke.[5][11][12][13][14]
In order to simplify the presentation of Paxos, the following assumptions and definitions are made explicit. Techniques to broaden the applicability are known in the literature, and are not covered in this article; please see references for further reading.
In general, a consensus algorithm can make progress using 2F+1 processors despite the simultaneous failure of any F processors.[15] However, using reconfiguration, a protocol may be employed which survives more total failures if they do not occur too rapidly:
Paxos describes the actions of the processes by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol.
Quorums express the safety properties of Paxos by ensuring at least some surviving processor retains knowledge of the results.
Quorums are defined as subsets of the set of Acceptors such that any two subsets (that is, any two Quorums) share at least one member. Typically, a Quorum is any majority of participating Acceptors. For example, given the set of Acceptors {A,B,C,D}, a majority Quorum would be any three Acceptors: {A,B,C}, {A,C,D}, {A,B,D}, {B,C,D}. More generally, arbitrary positive weights can be assigned to Acceptors and a Quorum defined as any subset of Acceptors with the summary weight greater than half of the total weight of all Acceptors.
Each attempt to define an agreed value v is performed with proposals which may or may not be accepted by Acceptors. Each proposal is uniquely numbered for a given Proposer. The value corresponding to a numbered proposal can be computed as part of running the Paxos protocol, but does not have to.
In order to guarantee safety, Paxos defines three safety properties and ensures they are always held, regardless of the pattern of failures:
In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor and Learner.[16] This reduces the message complexity significantly, without sacrificing correctness:
“ | In Paxos, clients send commands to a leader. During normal operation, the leader receives a client's command, assigns it a new command number i, and then begins the ith instance of the consensus algorithm by sending messages to a set of acceptor processes.[13] | ” |
By merging roles, the protocol "collapses" into an efficient client-master-replica style deployment, typical of the database community. The benefit of the Paxos protocols (including implementations with merged roles) is the guarantee of its safety properties.
A typical implementation's message flow is covered in the section Typical Multi-Paxos deployment.
This protocol is the most basic of the Paxos family. Each instance of the Basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds. A successful round has two phases. A Proposer should not initiate Paxos if it cannot communicate with at least a Quorum of Acceptors:
(first round is successful)
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | |<---------X--X--X | | Promise(1,{null,null,null}) | X--------->|->|->| | | Accept!(1,V) | |<---------X--X--X------>|->| Accepted(1,V) |<---------------------------------X--X Response | | | | | | |
The simplest error cases are the failure of a redundant Learner, or failure of an Acceptor when a Quorum of Acceptors remains live. In these cases, the protocol requires no recovery. No additional rounds or messages are required, as shown below:
(Quorum size = 2 Acceptors)
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | | | | ! | | !! FAIL !! | |<---------X--X | | Promise(1,{null, null}) | X--------->|->| | | Accept!(1,V) | |<---------X--X--------->|->| Accepted(1,V) |<---------------------------------X--X Response | | | | | |
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | |<---------X--X--X | | Promise(1,{null,null,null}) | X--------->|->|->| | | Accept!(1,V) | |<---------X--X--X------>|->| Accepted(1,V) | | | | | | ! !! FAIL !! |<---------------------------------X Response | | | | | |
The next failure case is when a Proposer fails after proposing a value, but before agreement is reached. Ignoring Leader election, an example message flow is as follows:
(re-election not shown, one instance, two rounds)
Client Leader Acceptor Learner | | | | | | | X----->| | | | | | Request | X------------>|->|->| | | Prepare(1) | |<------------X--X--X | | Promise(1,{null, null, null}) | | | | | | | | | | | | | | !! Leader fails during broadcast !! | X------------>| | | | | Accept!(1,Va) | ! | | | | | | | | | | | | !! NEW LEADER !! | X--------->|->|->| | | Prepare(2) | |<---------X--X--X | | Promise(2,{null, null, null}) | X--------->|->|->| | | Accept!(2,V) | |<---------X--X--X------>|->| Accepted(2,V) |<---------------------------------X--X Response | | | | | | |
The most complex case is when multiple Proposers believe themselves to be Leaders. For instance the current leader may fail and later recover, but the other Proposers have already re-elected a new leader. The recovered leader has not learned this yet and attempts to begin a round in conflict with the current leader.
(one instance, four unsuccessful rounds)
Client Proposer Acceptor Learner | | | | | | | X----->| | | | | | Request | X------------>|->|->| | | Prepare(1) | |<------------X--X--X | | Promise(1,{null,null,null}) | ! | | | | | !! LEADER FAILS | | | | | | | !! NEW LEADER (knows last number was 1) | X--------->|->|->| | | Prepare(2) | |<---------X--X--X | | Promise(2,{null,null,null}) | | | | | | | | !! OLD LEADER recovers | | | | | | | | !! OLD LEADER tries 2, denied | X------------>|->|->| | | Prepare(2) | |<------------X--X--X | | Nack(2) | | | | | | | | !! OLD LEADER tries 3 | X------------>|->|->| | | Prepare(3) | |<------------X--X--X | | Promise(3,{null,null,null}) | | | | | | | | !! NEW LEADER proposes, denied | | X--------->|->|->| | | Accept!(2,Va) | | |<---------X--X--X | | Nack(3) | | | | | | | | !! NEW LEADER tries 4 | | X--------->|->|->| | | Prepare(4) | | |<---------X--X--X | | Promise(4,{null,null,null}) | | | | | | | | !! OLD LEADER proposes, denied | X------------>|->|->| | | Accept!(3,Vb) | |<------------X--X--X | | Nack(4) | | | | | | | | ... and so on ...
A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. If each command is the result of a single instance of the Basic Paxos protocol, a significant amount of overhead would result.
If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.
To achieve this, the instance number I is included along with each value. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.
(first instance with new leader)
Client Proposer Acceptor Learner | | | | | | | --- First Request --- X-------->| | | | | | Request | X--------->|->|->| | | Prepare(N) | |<---------X--X--X | | Promise(N,I,{Va,Vb,Vc}) | X--------->|->|->| | | Accept!(N,I,Vm) | |<---------X--X--X------>|->| Accepted(N,I,Vm) |<---------------------------------X--X Response | | | | | | |
Vm = last of (Va, Vb, Vc)
(subsequent instances with same leader)
Client Proposer Acceptor Learner | | | | | | | --- Following Requests --- X-------->| | | | | | Request | X--------->|->|->| | | Accept!(N+1,I,W) | |<---------X--X--X------>|->| Accepted(N+1,I,W) |<---------------------------------X--X Response | | | | | | |
The most common deployment of the Paxos family is Multi-Paxos,[16] specialized for participating processors to each be Proposers, Acceptors and Learners. The message flow may be optimized as depicted here:
(first instance with new leader)
Client Servers | | | | --- First Request --- X-------->| | | Request | X->|->| Prepare(N) | |<-X--X Promise(N,I,{Va,Vb,Vc}) | X->|->| Accept!(N,I,Vn) | |<-X--X Accepted(N,I) |<--------X | | Response | | | |
(subsequent instances with same leader)
Client Servers X-------->| | | Request | X->|->| Accept!(N+1,I,W) | |<-X--X Accepted(N+1,I) |<--------X | | Response | | | |
A number of optimizations reduce message complexity and size. These optimizations are summarized below:
Cheap Paxos extends Basic Paxos to tolerate F failures with F+1 main processors and F auxiliary processors by dynamically reconfiguring after each failure.
This reduction in processor requirements comes at the expense of liveness; if too many main processors fail in a short time, the system must halt until the auxiliary processors can reconfigure the system. During stable periods, the auxiliary processors take no part in the protocol.
3 main Acceptors, 1 Auxiliary Acceptor, Quorum size = 3, showing failure of one main processor and subsequent reconfiguration
{ Acceptors } Proposer Main Aux Learner | | | | | | -- Phase 2 -- X----------->|->|->| | | Accept!(N,I,V) | | | ! | | --- FAIL! --- |<-----------X--X--------------->| Accepted(N,I,V) | | | | | -- Failure detected (only 2 accepted) -- X----------->|->|------->| | Accept!(N,I,V) (re-transmit, include Aux) |<-----------X--X--------X------>| Accepted(N,I,V) | | | | | -- Reconfigure : Quorum = 2 -- X----------->|->| | | Accept!(N,I+1,W) (Aux not participating) |<-----------X--X--------------->| Accepted(N,I+1,W) | | | | |
Fast Paxos generalizes Basic Paxos to reduce end-to-end message delays. In Basic Paxos, the message delay from client request to learning is 3 message delays. Fast Paxos allows 2 message delays, but requires the Client to send its request to multiple destinations.
Intuitively, if the leader has no value to propose, then a client could send an Accept! message to the Acceptors directly. The Acceptors would respond as in Basic Paxos, sending Accepted messages to the leader and every Learner achieving two message delays from Client to Learner.
If the leader detects a collision, it resolves the collision by sending Accept! messages for a new round which are Accepted as usual. This coordinated recovery technique requires four message delays from Client to Learner.
The final optimization occurs when the leader specifies a recovery technique in advance, allowing the Acceptors to perform the collision recovery themselves. Thus, uncoordinated collision recovery can occur in three message delays (and only two message delays if all Learners are also Acceptors).
Client Leader Acceptor Learner | | | | | | | | | X--------->|->|->|->| | | Any(N,I,Recovery) | | | | | | | | X------------------->|->|->|->| | | Accept!(N,I,W) | |<---------X--X--X--X------>|->| Accepted(N,I,W) |<------------------------------------X--X Response(W) | | | | | | | |
Conflicting proposals with uncoordinated recovery. Note: the protocol does not specify how to handle the dropped client request.
Client Leader Acceptor Learner | | | | | | | | | | | X------->|->|->|->| | | Any(N,I,Recovery) | | | | | | | | | | | | | | | | | | !! Concurrent conflicting proposals | | | | | | | | | !! received in different order | | | | | | | | | !! by the Acceptors | X--------------?|-?|-?|-?| | | Accept!(N,I,V) X-----------------?|-?|-?|-?| | | Accept!(N,I,W) | | | | | | | | | | | | | | | | | | !! Acceptors disagree on value | | |<-------X--X->|->|----->|->| Accepted(N,I,V) | | |<-------|<-|<-X--X----->|->| Accepted(N,I,W) | | | | | | | | | | | | | | | | | | !! Detect collision & recover | | |<-------X--X--X--X----->|->| Accepted(N+1,I,W) |<---------------------------------X--X Response(W) | | | | | | | | |
(merged Acceptor/Learner roles)
Client Servers | | | | | | | | X->|->|->| Any(N,I,Recovery) | | | | | | | | | | | | !! Concurrent conflicting proposals | | | | | | !! received in different order | | | | | | !! by the Servers | X--------?|-?|-?|-?| Accept!(N,I,V) X-----------?|-?|-?|-?| Accept!(N,I,W) | | | | | | | | | | | | !! Servers disagree on value | | X--X->|->| Accepted(N,I,V) | | |<-|<-X--X Accepted(N,I,W) | | | | | | | | | | | | !! Detect collision & recover |<-----------X--X--X--X Response(W) | | | | | |
Generalized consensus explores the relationship between the operations of a distributed state machine and the consensus protocol used to maintain consistency of that state machine. The main discovery involves optimizations of the consensus protocol when conflicting proposals could be applied to the state machine in any order. i.e.: The operations proposed by the conflicting proposals are commutative operations of the state machine.
In such cases, the conflicting operations can both be accepted, avoiding the delays required for resolving conflicts and re-proposing the rejected operation.
This concept is further generalized into ever-growing sets of commutative operations, some of which are known to be stable (and thus may be executed). The protocol tracks these sets of operations, ensuring that all proposed commutative operations of one set are stabilized before allowing any non-commuting operation to become stable.
Read(A) | Write(A) | Read(B) | Write(B) | |
---|---|---|---|---|
Read(A) | X | |||
Write(A) | X | X | ||
Read(B) | X | |||
Write(B) | X | X |
1:Read(A) 2:Read(B) 3:Write(B) 4:Read(B) 5:Read(A) 6:Write(A) 7:Read(A)
{ 1:Read(A), 2:Read(B), 5:Read(A) } { 3:Write(B), 6:Write(A) } { 4:Read(B), 7:Read(A) }
5:Read(A)
may commute in front of 3:Write(B)/4:Read(B)
pair.4:Read(B)
may commute behind the 3:Write(B)/6:Write(A)
pair.Responses not shown. Note: message abbreviations differ from previous message flows due to specifics of the protocol, see [13] for a full discussion.
{ Acceptors } Client Leader Acceptor Learner | | | | | | | | !! New Leader Begins Round | | X----->|->|->| | | Prepare(N) | | |<-----X--X--X | | Promise(N,null) | | X----->|->|->| | | Phase2Start(N,null) | | | | | | | | | | | | | | | | !! Concurrent commuting proposals | X--------?|-----?|-?|-?| | | Propose(ReadA) X-----------?|-----?|-?|-?| | | Propose(ReadB) | | X------X-------------->|->| Accepted(N,<ReadA,ReadB>) | | |<--------X--X-------->|->| Accepted(N,<ReadB,ReadA>) | | | | | | | | | | | | | | | | !! No Conflict, both accepted | | | | | | | | Stable = <ReadA, ReadB> | | | | | | | | | | | | | | | | !! Concurrent conflicting proposals X-----------?|-----?|-?|-?| | | Propose(<WriteB,ReadA>) | X--------?|-----?|-?|-?| | | Propose(ReadB) | | | | | | | | | | X------X-------------->|->| Accepted(N,<WriteB,ReadA> . <ReadB>) | | |<--------X--X-------->|->| Accepted(N,<ReadB> . <WriteB,ReadA>) | | | | | | | | | | | | | | | | !! Conflict detected, leader chooses | | | | | | | | commutative order: | | | | | | | | V = <ReadA, WriteB, ReadB> | | | | | | | | | | X----->|->|->| | | Phase2Start(N+1,V) | | |<-----X--X--X-------->|->| Accepted(N+1,V) | | | | | | | | Stable = <ReadA, ReadB> . | | | | | | | | <ReadA, WriteB, ReadB> | | | | | | | | | | | | | | | | !! More conflicting proposals X-----------?|-----?|-?|-?| | | Propose(WriteA) | X--------?|-----?|-?|-?| | | Propose(ReadA) | | | | | | | | | | X------X-------------->|->| Accepted(N+2,<WriteA> . <ReadA>) | | |<--------X--X-------->|->| Accepted(N+2,<ReadA> . <WriteA>) | | | | | | | | | | | | | | | | !! Leader chooses order W | | X----->|->|->| | | Phase2Start(N+2,W) | | |<-----X--X--X-------->|->| Accepted(N+2,W) | | | | | | | | Stable = <ReadA, ReadB> . | | | | | | | | <ReadA, WriteB, ReadB> . | | | | | | | | <WriteA, ReadA> | | | | | | | |
Paxos may also be extended to support arbitrary failures of the participants, including lying, fabrication of messages, collusion with other participants, selective non-participation, etc. These types of failures are called Byzantine failures, after the solution popularized by Lamport.[18]
Byzantine Paxos[12][14] adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors:
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Accept!(N,I,V) | | X<>X<>X | | Verify(N,I,V) - BROADCAST | |<---------X--X--X------>|->| Accepted(N,V) |<---------------------------------X--X Response(V) | | | | | | |
Fast Byzantine Paxos removes this extra delay, since the client sends commands directly to the Acceptors.[12]
Note the Accepted message in Fast Byzantine Paxos is sent to all Acceptors and all Learners, while Fast Paxos sends Accepted messages only to Learners):
Client Acceptor Learner | | | | | | X----->|->|->| | | Accept!(N,I,V) | X<>X<>X------>|->| Accepted(N,I,V) - BROADCAST |<-------------------X--X Response(V) | | | | | |
The failure scenario is the same for both protocols; Each Learner waits to receive F+1 identical messages from different Acceptors. If this does not occur, the Acceptors themselves will also be aware of it (since they exchanged each other's messages in the broadcast round), and correct Acceptors will re-broadcast the agreed value:
Client Acceptor Learner | | | ! | | !! One Acceptor is faulty X----->|->|->! | | Accept!(N,I,V) | X<>X<>X------>|->| Accepted(N,I,{V,W}) - BROADCAST | | | ! | | !! Learners receive 2 different commands | | | ! | | !! Correct Acceptors notice error and choose | X<>X<>X------>|->| Accepted(N,I,V) - BROADCAST |<-------------------X--X Response(V) | | | ! | |
Although the virtual synchrony model and Paxos are sometimes viewed as competing options for creating state machine replicated services and similar applications, recent work[19] has demonstrated the possibility of combining the two models into a single unified solution. This approach uses virtual synchrony to support execution epochs, within which membership is determined by a consensus protocol and stable. Fault-tolerant protocols can then be defined on a per-epoch basis. The approach offers better performance than other options for making the membership of a Paxos service dynamic, while strengthening the virtual synchrony model.